Software tool for detecting plagiarism in computer source code

ABSTRACT

Plagiarism of software source code is a serious problem in two distinct areas of endeavor-cheating by students at schools and intellectual property theft at corporations. A number of algorithms have been implemented to check source code files for plagiarism, each with their strengths and weaknesses. This invention consists of a combination of algorithms in a single software program to assist a human expert in finding plagiarized code.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to software tools for comparing programsource code files to determine the amount of similarity between thefiles and to pinpoint specific sections that are similar. In particular,the present invention relates to finding pairs of source code files thathave been copied, in full or in part, from each other or from a commonthird file.

2. Discussion of the Related Art

Plagiarism detection programs and algorithms have been around for anumber of years but have gotten more attention recently due to two mainfactors. One reason is that the Internet and search engines like Googlehave made source code very easy to obtain. Another reason is the growingopen source movement that allows programmers all over the world towrite, distribute, and share code. It follows that plagiarism detectionprograms have become more sophisticated in recent years. An excellentsummary of available tools is given by Paul Clough in his paper,“Plagiarism in natural and programming languages: an overview of currenttools and technologies.” Clough discusses tools and algorithms forfinding plagiarism in generic text documents as well as in programminglanguage source code files. The present invention only relates to toolsand algorithms for finding plagiarism in programming language sourcecode files and so the discussion will be confined to those types oftools. Following are brief descriptions of four of the most populartools and their algorithms.

The Plague program was developed by Geoff Whale at the University of NewSouth Wales. Plague uses an algorithm that creates what is called astructure-metric, based on matching code structures rather than matchingthe code itself. The idea is that two pieces of source code that havethe same structures are likely to have been copied. The Plague algorithmignores comments, variable names, function names, and other elementsthat can easily be globally or locally modified in an attempt to fool aplagiarism detection tool.

Plague has three phases to its detection, as illustrated in FIG. 1:

-   -   1. In the first phase 101, a sequence of tokens and structure        metrics are created to form a structure profile for each source        code file. In other words, each program is boiled down to basic        elements that represent control structures and data structures        in the program.    -   2. In the second phase 102, the structure profiles are compared        to find similar code structures. Pairs of files with similar        code structures are moved into the next stage.    -   3. In the final stage 103, token sequences within matching        source code structures are compared using a variant of the        Longest Common Subsequence (LCS) algorithm to find similarity.

Clough points out three problems with Plague:

-   -   1. Plague is hard to adapt to new programming languages because        it is so dependent on expert knowledge of the programming        language of the source code it is examining. The tokens depend        on specific language statements and the structure metrics depend        specific programming language structures.    -   2. The output of Plague consists of two indices H an HT that        needs interpretation. While the output of each plagiarism        detection program presented here relies on expert        interpretation, results from-Plague are particularly obscure.    -   3. Plague uses UNIX shell tools for processing, which makes it        slow. This is not an innate problem with the algorithm, which        can be ported to compiled code for faster processing.

There are other problems with Plague:

-   -   1. Plague is vulnerable to changing the order of code lines in        the source code.    -   2. Plague throws out useful information when it discards        comments, variable names, function names, and other identifiers.

The first point is a problem because code sections can be rearranged andindividual lines can be reordered to fool Plague into giving lowerscores or missing copied code altogether. This is one method thatsophisticated plagiarists use to hide malicious code theft.

The second point is a problem because comments, variable names, functionnames, and other identifiers can be very useful in finding plagiarism.These identifiers can pinpoint copied code immediately. Even in manycases of intentional copying, comments are left in the copied code andcan be used to find matches. Common misspellings or the use ofparticular words throughout the program in two sets of source code canhelp identify them as having the same author even if the code structuresthemselves do not match. As we will see, this is a common problem withthese plagiarism tools.

The YAP programs (YAP, YAP2, YAP3) were developed by Michael Wise at theUniversity of Sydney, Australia. YAP stands for “Yet Another Plague” andis an extension of Plague. All three version of YAP use algorithms,illustrated in FIG. 2, that can generally be described in two phases asfollows:

-   -   1. In the first phase 201, generate a list of tokens for each        source code file.    -   2. In the second phase 202, compare pairs of token files.

The first phase of the algorithm is identical for all three programs.The steps of this phase, illustrated in FIG. 2, are:

-   -   1. In step 203 remove comments and string constants.    -   2. In step 204 translate upper-case letters to lower-case.    -   3. In step 205, map synonyms to a common form. In other words,        substitute a basic set of programming language statements for        common, nearly equivalent statements. As an example using the C        language, the language keyword “strncmp” would be mapped to        “strcmp”, and the language keyword “function” would be mapped to        “procedure”.    -   4. In step 206, reorder the functions into their calling order.        The first call to each function is expanded inline and tokens        are substituted appropriately. Each subsequent call to the same        function is simply replaced by the token FUN.    -   5. In step 207, remove all tokens that are not specifically        programming language keywords.

The second phase 202 of the algorithm is identical for YAP and YAP2. YAPrelied on the sdiff function in UNIX to compare lists of tokens for thelongest common sequence of tokens. YAP2, implemented in Per1, improvedperformance in the second phase 202 by utilizing a more sophisticatedalgorithm known as Heckel's algorithm. One limitation of YAP and YAP2that was recognized by Wise was difficulty dealing with transposed code.In other words, functions or individual statements could be rearrangedto hide plagiarism. So for YAP3, the second phase uses theRunning-Karp-Rabin Greedy-String-Tiling (RKR-GST) algorithm that is moreimmune to tokens being transposed.

YAP3 is an improvement over Plague in that it does not attempt a fullparse of the programming language as Plague does. This simplifies thetask of modifying the tool to work with other programming languages.Also, the new algorithm is better able to find matches in transposedlines of code.

There are still problems with YAP3 that need to be noted:

-   -   1. In order to decrease the run time of the program the RKR-GST        algorithm uses hashing and only considers matches of strings of        a minimal length. This opens up the algorithm to missing some        matches.    -   2. The tokens used by YAP3 are still dependent on knowledge of        the particular programming language of the files being compared.    -   3. Although less so than Plague, YAP3 is still vulnerable to        changing the order of code lines in the source code.    -   4. YAP3 throws out much useful information when it discards        comments, variable names, function names, and other identifiers        that can-and have been used to find source code with common        origins.

JPlag is a program, written in Java by Lutz Prechelt and Guido Malpohlof the University Karlsruhe and Michael Philippsen of the University ofErlangen-Nuremberg, to detect plagiarism in Java, Scheme, C, or C++source code. Like other plagiarism detection programs, JPlag works inphases as illustrated in FIG. 3:

-   -   1. There are two steps in the first phase 301. In the first step        303, whitespace, comments, and identifier names are removed. As        with Plague and the YAP programs, in the second step 304, the        remaining language statements are replaced by tokens.    -   2. As with YAP3, the method of Greedy String Tiling is used to        compare tokens in different files in the second phase 302. More        matching tokens corresponds to a higher degree of similarity and        a greater chance of plagiarism.

As can be seen from the description, JPlag is nearly identical in itsalgorithm to YAP3 though it uses different optimization procedures forreducing runtime. One difference is that JPlag produces a very nice HTMLoutput with detailed plots comparing file similarities. It also allowsthe user to click on a file combination to bring up windows showing bothfiles with areas of similarity highlighted. The limitations of JPlag arethe same limitations that apply to YAP3 that have been listedpreviously.

The Measure of Software Similarity (MOSS) program was developed at theUniversity of California at Berkeley by Alex Aiken. MOSS uses awinnowing algorithm. The MOSS algorithm can be described by these steps,as illustrated in FIG. 4:

-   -   1. In the first step 401, remove all whitespace and punctuation        from each source code file and convert all characters to lower        case.    -   2. In the second step 402, divide the remaining non-whitespace        characters of each file into k-grams, which are contiguous        substrings of length k, by sliding a window of size k through        the file. In this way the second character of the first k-gram        is the first character of the second k-gram and so on.    -   3. In the third step 403, hash each k-gram and select a subset        of all k-grams to be the fingerprints of the document. The        fingerprint includes information about the position of each        selected k-gram in the document.    -   4. In the fourth step 404, compare file fingerprints to find        similar files.

An example of the algorithm for creating these fingerprints is shown inFIG. 5. Some text to be compared is shown in part (a) 501. The 5-gramsderived from the text is shown in part (b) 502. A possible sequence ofhashes is shown in part (c) 503. A possible selection of hashes chosento be the fingerprint for the text is shown in part (d) 504. The conceptis that the hash function is chosen so that the probability ofcollisions is very small so that whenever two documents sharefingerprints, it is extremely likely that they share k-grams as well andthus contain plagiarized code.

Of all the programs discussed here, MOSS throws out the mostinformation. The algorithm attempts to keep enough critical informationto flag similarities. The algorithm is also noted to have a very lowoccurrence of false positives. The problem using this algorithm fordetecting source code plagiarism is that it produces a high occurrenceof false negatives. In other words, matches can be missed. The reasonfor this is as follows:

-   -   1. By treating source code files like generic text files, much        structural information is lost that can be used to find matches.        For example, whitespace, punctuation, and uppercase characters        have significant meaning in programming languages but are thrown        out by MOSS.    -   2. Smaller k-grams increase the execution time of the program,        but increase the sensitivity. MOSS makes the tradeoff of time        for efficiency and typically uses a 5-gram. However, many        programming language statements are less than 5 characters and        can be missed.    -   3. Most of the k-grams are also thrown out, reducing the        accuracy even further.

SUMMARY OF THE INVENTION

Plagiarism of software source code is a serious problem in two distinctareas of endeavor these days—cheating by students at schools andintellectual property theft at corporations. A number of methods havebeen implemented to check source code files for plagiarism, each withtheir strengths and weaknesses. The present invention is a new methodconsisting of a combination of algorithms in a single tool to assist ahuman expert in finding plagiarized code. The present invention usesfive algorithms to find plagiarism: Source Line Matching, Comment LineMatching, Word Matching, Partial Word Matching, and Semantic SequenceMatching.

Further features and advantages of various embodiments of the presentinvention are described in the detailed description below, which isgiven by way of example only.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood more fully from the detaileddescription given below and from the accompanying drawings of thepreferred embodiment of the invention, which, however, should not betaken to limit the invention to the specific embodiment but are forexplanation and understanding only.

FIG. 1 illustrates the algorithm used by the Plague program for sourcecode plagiarism detection.

FIG. 2 illustrates the algorithm used by the YAP, YAP2, and YAP3programs for source code plagiarism detection.

FIG. 3 illustrates the algorithm used by the JPlag program for sourcecode plagiarism detection.

FIG. 4 illustrates the algorithm used by the MOSS program for sourcecode plagiarism detection.

FIG. 5 illustrates the fingerprinting algorithm used by the MOSS programfor source code plagiarism detection.

FIG. 6 illustrates dividing a file of source code into source lines,comment lines, and words.

FIG. 7 illustrates matching partial words in a pair of files.

FIG. 8 illustrates matching source lines in a pair of files.

FIG. 9 illustrates matching comment lines in a pair of files.

FIG. 10 illustrates the sequence of algorithms comprising the presentinvention.

FIG. 11 shows a sample basic report output.

FIG. 12 shows a sample detailed report output.

DETAILED DESCRIPTION

The present invention will be understood more fully from the detaileddescription given below and from the accompanying drawings of thepreferred embodiment of the invention, which, however, should not betaken to limit the invention to the specific embodiment but are forexplanation and understanding only.

The present invention takes a different approach to plagiarism detectionthan the programs described previously. The present invention comparesfeatures of each pair of source code files completely, rather than usinga sampling method for comparing a small number of hashed samples ofcode. This may require a computer program that implements the presentinvention to run for hours or in some cases days to find plagiarismamong large sets of large files. Given the stakes in many intellectualproperty theft cases, this more accurate method is worth the processingtime involved. And it is certainly less expensive than hiring experts onan hourly basis to manually pore over code by hand.

The present invention makes use of a basic knowledge of programminglanguages and program structures to simplify the matching task. There isa small amount of information needed in the form of a list of commonprogramming language statements that the present invention mustrecognize. This list is specific to the programming language beingexamined. In addition, the present invention needs information oncharacters that are used to identify comments and characters that areused as separators.

The present invention uses five algorithms to find plagiarism: SourceLine Matching, Comment Line Matching, Word Matching, Partial WordMatching, and Semantic Sequence Matching. Each algorithm is useful infinding different clues to plagiarism that the other algorithms maymiss. By using all five algorithms, chances of missing plagiarized codeis significantly diminished. Before any of the algorithm processingtakes place, some preprocessing is done to create string arrays. Eachfile is represented by three arrays—an array of source lines thatconsists of lines of functional source code and does not includecomments, an array of comment lines that do not include functionalsource code, and an array of identifiers found in the course code.Identifiers include variable names, constant names, function names, andany other words that are not keywords of the programming language.

In one embodiment of the present invention, each line of each file isinitially examined and two string arrays for each file are created:SourceLines1 [ ], CommentLines1 [ ] and SourceLines2 [ ], CommentLines2[ ] are the source lines and comment lines for file 1 and file 2respectively. Examples of these arrays are shown for a sample codesnippet in FIG. 6. A sample snippet of a source code file to be examinedis shown in part (a) 601. The separation of source lines and commentslines for the code snippet is shown in part (b) 602. Note thatwhitespace is not removed entirely, but rather all sequences ofwhitespace characters are replaced by a single space in both sourcelines and comment lines. In this way, the individual words are preservedin the strings. Separator characters such as {, }, and ; are treated aswhitespace. The comment characters themselves, in this case /*, */, and//, are stripped off from the comments. We are only interested in thecontent of each comment but not the layout of the comment. Specialcharacters such as comment delimiters and separator characters aredefined in a language definition file that is input to this embodimentof the present invention.

Note that blank lines are preserved as null strings in the array. Thisis done so that the index in each array corresponds to the line numberin the original file and matching lines can easily be mapped back totheir original files.

Next the source lines are examined from each file to obtain a list ofall words in the source code that are not programming language keywords,as shown in part (c) 603 of FIG. 6. Note that identifier j is not listedas an identifier because all 1-character words are ignored as too commonto consider. At this point, this embodiment of the present invention isready to begin applying the matching algorithms.

Word Matching

For each file pair, this embodiment of the present invention uses a“word matching” algorithm to count the number of matchingidentifiers—identifiers being words that are not programming languagekeywords. In order to determine whether a word is a programming languagekeyword, comparison is done with a list of known programming languagekeywords. For example, the word “while” in a C source code file would beignored as a keyword by this algorithm. In some programming languageslike C and Java, keywords are case sensitive. In other programminglanguages like Basic, keywords are not case sensitive. This embodimenthas a switch to turn case sensitivity on or off depending on theprogramming language being examined. So for a case sensitive languagelike C, the word “While” would not be considered a language keyword andwould not be ignored. In a case insensitive language like Basic, theword “While” would be considered a language keyword and would beignored. In either case, when comparing non-keyword words in the filepairs, case is ignored so that the word “Index” in one file would bematched with the word “index” in the other. This case-insensitivecomparison is done to prevent being fooled by simple case changes inplagiarized code in an attempt to avoid detection.

This simple comparison yields a number w representing the number ofmatching identifier words in the source code of the pair of files. Thisnumber is determined by the equationw=Σ(A _(i) +f _(N) N _(i)) for i=1 to m_(w)where m_(w) is the number of case-insensitive matching non-keyword wordsin the two files, A_(i) is the number of matching alphabeticalcharacters in matching word i, N_(i) is the number of matching numeralsin matching word i, and f_(N) is a fractional value given to matchingnumerals in a matching word. The reason for this fractional value isthat alphabetical characters are less likely to match by chance, butnumerals may match simply because they represent common mathematicalconstants—the value of pi for example—rather than because of plagiarism.Longer sequences of letters and/or numerals have a smaller probabilityof matching by chance and therefore deserve more consideration aspotential plagiarism.

This algorithm tends to uncover code where common identifier names areused for variables, constants, and functions, implying that the code wasplagiarized. Since this algorithm only eliminates standard programminglanguage statements, common library routines that are used on both fileswill produce a high value of w. Code that uses a large number of thesame library routines also has a higher chance of being plagiarizedcode.

Partial Word Matching

The “partial word matching” algorithm examines each identifier(non-keyword) word in the source code of one file of a file pair andfinds all words that match a sequence within one or more non-keywordwords in the other file of a file pair. Like the word matchingalgorithm, this one is also case insensitive. This algorithm isillustrated in FIG. 7. In part (a) 701, the non-keyword words from thetwo files are displayed. In part (b) 702, every word from one file thatcan be found as a sequence within a word from the other file is listed.So the identifier “abc” in file 1 can be found within identifiers“aabc”, “abc1111111”, and “abcxxxyz” in file 2. Note that identifier“pdq” is not listed in the array of partially matching words because itmatches completely and was already considered in the word matchingalgorithm. Also note that identifier “x” is not listed in the arraybecause 1-character words are ignored.

This algorithm works just like the word match algorithm on the list ofpartially matching words. It yields a number p representing the numberof partially matching identifier words in the source code of the pair offiles. This number is determined by the equationp=Σ(A _(i) +f _(N) N _(i)) for i=1 to m_(p)where m_(p) is the number of case-insensitive matching partial words inthe two files, A_(i) is the number of matching alphabetical charactersin matching partial word i, N_(i) is the number of matching numerals inmatching partial word i, and f_(N) is a fractional value given tomatching numbers in a matching partial word.Source Line Matching

The “source line matching” algorithm compares each line of source codefrom both files, ignoring case. We refer to functional program languagelines as source lines and exclude comment lines. Also, sequences ofwhitespace are converted to single spaces so that the syntax structureof the line is preserved. Note that a line of source code may have acomment at the end, in which case the comment is stripped off for thiscomparison. Source lines that contain only programming language keywordsare not examined. For source lines to be considered matches, they mustcontain at least one non-keyword such as a variable name or functionname. Otherwise, lines containing basic operations would be reported asmatching. FIG. 8 illustrates this algorithm. Part (a) 801 shows thelines of two files along with line numbers. Part (b) 802 shows thesource line numbers in the two files that are considered matching.

This algorithm yields a number s representing the number of matchingsource lines in the pair of files.

Comment Line Matching

The “comment line matching” algorithm compares each line of commentsfrom both files, again ignoring case. Note that a line of source codemay have a comment at the end. The source code is stripped off for thiscomparison, leaving only the comment. The entire comment is compared,regardless of whether there are keywords in the comment or not. FIG. 9shows two files along with line numbers and the comment lines that areconsidered matching. Part (a) 901 shows the lines of two files alongwith line numbers. Part (b) 902 shows the comment line numbers in thetwo files that are considered matching.

This algorithm yields a number c representing the number of matchingcomment lines in the pair of files.

Semantic Sequence Matching

The “semantic sequence” algorithm compares the first word of everysource line in the pair of files, ignoring blank lines and commentlines. This algorithm finds sequences of code that appear to perform thesame functions despite changed comments and identifier names. Thealgorithm finds the longest common semantic sequence within both files.Look at the example code in FIG. 9 part (a) 901. In this case, thesemantic sequence of lines 2 through 9 in file 1 matches the semanticsequence of lines 2 through 8 in file 2 because the first word in eachnon-blank line in file 1 is identical to the first word of thecorresponding line in file 2. There are 6 source lines in this sequence,so the algorithm yields a value of 6. If a longer sequence of sourcelines is found in the file, this algorithm returns the number of sourcelines in the longer sequence. This algorithm yields a number qrepresenting the number of lines in the longest matching semanticsequence in the pair of files.

Match Score

The entire sequence, applying all five algorithms, is shown in FIG. 10.In the first step 1001, the source line, comment line, and word arraysfor the two files to be created are created. In the second step 1002,the source line arrays of the two files are compared using the sourceline matching algorithm. In the third step 1003, the comment line arraysof the two files are compared using the comment line matching algorithm.In the fourth step 1004, the word arrays of the two files are comparedusing the word matching algorithm. In the fifth step 1005, the wordarrays of the two files are compared using the partial word matchingalgorithm. In the sixth step 1006, the source line arrays of the twofiles are compared using the semantic sequence matching algorithm.Although all matching algorithms produce output for the user, in theseventh step 1007, the results of all matching algorithms are combinedinto a single match score.

The single match score t is a measure of the similarity of the filepairs. If a file pair has a higher score, it implies that these filesare more similar and may be plagiarized from each other or from a commonthird file. This score, known as a “total match score,” is given by thefollowing equation.t=k _(w) w+k _(p) p+k _(s) s+k _(c) c+k _(q) q

In this equation, each of the results of the five individual algorithmsis weighted and added to give a total matching score. These weights mustbe adjusted to give the optimal results. There is also a sixth weightthat is hidden in the above equation and must also be evaluated. Thatweight is f_(N), the fractional value given to matching numerals in amatching word or partial word. Thus the weights that must be adjusted toget a useful total matching score are: f_(N) the fractional value givento matching numerals in a matching word or partial word k_(w) the weightgiven to the word matching algorithm k_(p) the weight given to thepartial word matching algorithm k_(s) the weight given to the sourceline matching algorithm k_(c) the weight given to the comment linematching algorithm k_(q) the weight given to the semantic sequencematching algorithm

These numbers are adjusted by experimentation over time to give the bestresults. However, unlike the other programs described in this paper,this invention is not intended to give a specific cutoff threshold forfile similarity. There are many kinds of plagiarism and many ways offooling plagiarism detection programs. For this reason, this embodimentof the present invention produces a basic HTML output report with a listof file pairs ordered by their total match scores as shown in FIG. 11.This basic report includes a header 1101 and a ranking of file pairmatches for each file as shown in 1102 and 1103. Each match score shownis also a hyperlink.

The user can click on a match score hyperlink to bring up a detailedHTML report showing exact matches between the selected file pairs. Inthis way, experts are directed to suspicious similarities and allowed tomake their own judgments. A sample detailed report is shown in FIG. 12.The report includes a header 1201 that tells which files are beingcompared. The exact matching source lines and the corresponding linenumbers are given in the next section 1202. The exact matching commentlines and the corresponding line numbers are given in the next section1203. The number of lines in the longest matching semantic sequence andthe beginning line numbers for the sequence in each file are given inthe next section 1204. The matching words in the files are shown in thenext section 1205. The matching partial words in the files are shown inthe next section 1206.

The present invention is not a tool for precisely pinpointingplagiarized code, but rather a tool to assist an expert in findingplagiarized code. The present invention reduces the effort needed by theexpert by allowing him to narrow his focus from hundreds of thousands oflines in hundreds of files to dozens of lines in dozens of files.

Various modifications and adaptations of the operations that aredescribed here would be apparent to those skilled in the art based onthe above disclosure. Many variations and modifications within the scopeof the invention are therefore possible. The present invention is setforth by the following claims.

1) A method for comparing two program source code files to help anexpert determine whether one file contains source code that has beencopied from the other file or whether both files contain code that hasbeen copied from a third file, the method comprising a) eliminatingprogramming comments from the first source code file; b) eliminatingprogramming comments from the second source code file; c) substituting asingle space character for sequences of whitespace characters in eachremaining line of functional programming code in said first file; d)substituting a single space character for sequences of whitespacecharacters in each remaining line of functional programming code in saidsecond file; e) putting each remaining line of functional programmingcode of the first file into an array of text strings; f) putting eachremaining line of functional programming code of the second file into asecond array of text strings; and g) finding all matches between textstrings in said first array with text strings in said second array. 2)The method of claim 1) where finding all matches ignores the type caseof the text. 3) A method for comparing two program source code files tohelp an expert determine whether one file contains source code that hasbeen copied from the other file or whether both files contain code thathas been copied from a third file, the method comprising a) eliminatingfunctional programming lines from the first source code file, leavingcomment lines; b) eliminating functional programming lines from thesecond source code file, leaving comment lines; c) substituting a singlespace character for sequences of whitespace characters in each remainingcomment line in said first file; d) substituting a single spacecharacter for sequences of whitespace characters in each remainingcomment line in said second file; e) putting each remaining comment lineof the first file into an array of text strings; f) putting eachremaining comment line of the second file into a second array of textstrings; and g) finding all matches between text strings in said firstarray with text strings in said second array. 4) The method of claim 3)where finding all matches ignores the type case of the text. 5) A methodfor comparing two program source code files to help an expert determinewhether one file contains source code that has been copied from theother file or whether both files contain code that has been copied froma third file, the method comprising a) extracting all words betweenwhitespace from each line of functional programming code in the firstsource code file to an array of text strings; b) eliminating programminglanguage keywords from said array of text strings; c) extracting allwords between whitespace from each line of functional programming codein the second source code file to a second array of text strings; d)eliminating programming language keywords from said second array of textstrings; e) finding all matches between text strings in said first arraywith text strings in said second array. 6) The method of claim 5) wherefinding all matches ignores the type case of the text. 7) A method forcomparing two program source code files to help an expert determinewhether one file contains source code that has been copied from theother file or whether both files contain code that has been copied froma third file, the method comprising a) extracting all words betweenwhitespace from each line of functional programming code in the firstsource code file to an array of text strings; b) eliminating programminglanguage keywords from said array of text strings; c) extracting allwords between whitespace from each line of functional programming codein the second source code file to a second array of text strings; d)eliminating programming language keywords from said second array of textstrings; e) finding all partial matches between text strings in saidfirst array with text strings in said second array, where a partialmatch is where one string can be found in its entirety in as a secondstring but the strings are not identical. 8) The method of claim 7)where finding all partial matches ignores the type case of the text. 9)A method for comparing two program source code files to help an expertdetermine whether one file contains source code that has been copiedfrom the other file or whether both files contain code that has beencopied from a third file, the method comprising a) eliminatingprogramming comments from the first source code file; b) eliminatingprogramming comments from the second source code file; c) substituting asingle space character for sequences of whitespace characters in eachremaining line of functional programming code in said first file; d)substituting a single space character for sequences of whitespacecharacters in each remaining line of functional programming code in saidsecond file; e) putting each remaining line of functional programmingcode of the first file into an array of text strings; f) putting eachremaining line of functional programming code of the second file into asecond array of text strings; and g) finding sequences where the firstword of each line in said first array matches the first word of eachline in said second array. 10) The method of claim 9) where findingsequences where the first word of each line in said first array matchesthe first word of each line in said second array ignores the type caseof the text. 11) A method for comparing two program source code files,comprising: a) extracting from each program source code file a first setof code elements and a second set of code elements; b) computing a firstmetric derived from comparing the first set of code elements for thefirst program source code file to the first set of code elements for thesecond program source code file; c) computing a second metric derivedfrom comparing the second set of code elements for the first programsource code file to the second set of code elements for the secondprogram source code file; d) combining the first metric and the secondmetric to derive a combined metric, wherein the first and second sets ofcode elements are selected from the group consisting of complete words,selected partial words, selected source lines, selected comment linesand selected code sequences. 12) An apparatus for comparing two programsource code files to help an expert determine whether one file containssource code that has been copied from the other file or whether bothfiles contain code that has been copied from a third file, the apparatuscomprising A computer; A source code matching program on said computer,wherein said source code matching program comprises: a) means foreliminating programming comments from the first source code file; b)means for eliminating programming comments from the second source codefile; c) means for substituting a single space character for sequencesof whitespace characters in each remaining line of functionalprogramming code in said first file; d) means for substituting a singlespace character for sequences of whitespace characters in each remainingline of functional programming code in said second file; e) Putting eachremaining line of functional programming code of the first file into anarray of text strings; f) means for putting each remaining line offunctional programming code of the second file into a second array oftext strings; and g) means for finding all matches between text stringsin said first array with text strings in said second array. 13) Theapparatus of claim 11) where means for finding all matches ignores thetype case of the text. 14) An apparatus for comparing two program sourcecode files to help an expert determine whether one file contains sourcecode that has been copied from the other file or whether both filescontain code that has been copied from a third file, the apparatuscomprising A computer; A source code matching program on said computer,wherein said source code matching program comprises: a) means foreliminating functional programming lines from the first source codefile, leaving comment lines; b) means for eliminating functionalprogramming lines from the second source code file, leaving commentlines; c) means for substituting a single space character for sequencesof whitespace characters in each remaining comment line in said firstfile; d) means for substituting a single space character for sequencesof whitespace characters in each remaining comment line in said secondfile; e) means for putting each remaining comment line of the first fileinto an array of text strings; f) means for putting each remainingcomment line of the second file into a second array of text strings; andg) means for finding all matches between text strings in said firstarray with text strings in said second array. 15) The apparatus of claim14) where means for finding all matches ignores the type case of thetext. 16) An apparatus for comparing two program source code files tohelp an expert determine whether one file contains source code that hasbeen copied from the other file or whether both files contain code thathas been copied from a third file, the apparatus comprising A computer;A source code matching program on said computer, wherein said sourcecode matching program comprises: a) means for extracting all wordsbetween whitespace from each line of functional programming code in thefirst source code file to an array of text strings; b) means foreliminating programming language keywords from said array of textstrings; c) means for extracting all words between whitespace from eachline of functional programming code in the second source code file to asecond array of text strings; d) means for eliminating programminglanguage keywords from said second array of text strings; e) means forfinding all matches between text strings in said first array with textstrings in said second array. 17) The apparatus of claim 16) where meansfor finding all matches ignores the type case of the text. 18) Anapparatus for comparing two program source code files to help an expertdetermine whether one file contains source code that has been copiedfrom the other file or whether both files contain code that has beencopied from a third file, the apparatus comprising A computer; A sourcecode matching program on said computer, wherein said source codematching program comprises: a) means for extracting all words betweenwhitespace from each line of functional programming code in the firstsource code file to an array of text strings; b) means for eliminatingprogramming language keywords from said array of text strings; c) meansfor extracting all words between whitespace from each line of functionalprogramming code in the second source code file to a second array oftext strings; d) means for eliminating programming language keywordsfrom said second array of text strings; e) means for finding all partialmatches between text strings in said first array with text strings insaid second array, where a partial match is where one string can befound in its entirety in as a second string but the strings are notidentical. 19) The apparatus of claim 18) where means for finding allpartial matches ignores the type case of the text. 20) An apparatus forcomparing two program source code files to help an expert determinewhether one file contains source code that has been copied from theother file or whether both files contain code that has been copied froma third file, the apparatus comprising A computer; A source codematching program on said computer, wherein said source code matchingprogram comprises: a) means for eliminating programming comments fromthe first source code file; b) means for eliminating programmingcomments from the second source code file; c) means for substituting asingle space character for sequences of whitespace characters in eachremaining line of functional programming code in said first file; d)means for substituting a single space character for sequences ofwhitespace characters in each remaining line of functional programmingcode in said second file; e) means for putting each remaining line offunctional programming code of the first file into an array of textstrings; f) means for putting each remaining line of functionalprogramming code of the second file into a second array of text strings;and g) means for finding sequences where the first word of each line insaid first array matches the first word of each line in said secondarray. 21) The apparatus of claim 20) where means for finding sequenceswhere the first word of each line in said first array matches the firstword of each line in said second array ignores the type case of thetext. 22) An apparatus for comparing two program source code files,comprising: A computer; A source code matching program on said computer,wherein said source code matching program comprises: a) means forextracting from each program source code file a first set of codeelements and a second set of code elements; b) means for computing afirst metric derived from comparing the first set of code elements forthe first program source code file to the first set of code elements forthe second program source code file; c) means for computing a secondmetric derived from comparing the second set of code elements for thefirst program source code file to the second set of code elements forthe second program source code file; d) means for combining the firstmetric and the second metric to derive a combined metric, wherein thefirst and second sets of code elements are selected from the groupconsisting of complete words, selected partial words, selected sourcelines, selected comment lines and selected code sequences.